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Abstract —In this paper we present a new approach of in¬ 
corporating kernels into dictionary learning. The kernel K-SVD 
algorithm (KKSVD), which has been introduced recently, shows 
an improvement in classification performance, with relation to 
its linear counterpart K-SVD. However, this algorithm requires 
the storage and handling of a very large kernel matrix, which 
leads to high computational cost, while also limiting its use to 
setups with small number of training examples. We address these 
problems by combining two ideas: first we approximate the kernel 
matrix using a cleverly sampled subset of its columns using the 
Nystrom method; secondly, as we wish to avoid using this matrix 
altogether, we decompose it by SVD to form new “virtual sam¬ 
ples”, on which any linear dictionary learning can be employed. 
Our method, termed “Linearized Kernel Dictionary Learning” 
(LKDL) can be seamlessly applied as a pre-processing stage 
on top of any efficient off-the-shelf dictionary learning scheme, 
effectively “kernelizing” it. We demonstrate the effectiveness of 
our method on several tasks of both supervised and unsupervised 
classification and show the efficiency of the proposed scheme, its 
easy Integration and performance boosting properties. 

Index Terms —Dictionary Learning, Supervised Dictionary 
Learning, Kernel Dictionary Learning, Kernels, KSVD. 


I. Introduction 

T he field of sparse representations has witnessed great 
success in an array of applications in signal and image 
processing. The basic operation in sparse representations is 
called “sparse coding”, which involves the reconstruction of 
the signals of interest using a sparse set of building blocks, 
referred to as “atoms”. The atoms are gathered in a structure 
called the “dictionary”, which can be manually crafted to con¬ 
tain mathematical functions that are proven successful in repre¬ 
senting signals and images, such as wavelets m, curvelets El 
and contourlets El- Alternatively, it can be learned adaptively 
from input examples, a task referred to as “dictionary learning” 
(DL). The latter approach has provided state-of-the-art results 
in classic image processing applications, such as denoising 
HI, inpainting 13, demosaicing El, compression Q, 111 and 
more. Popular algorithms for dictionary learning are the MOD 
13 and the K-SVD 03, which generalizes K-means clustering 
and learns an overcomplete dictionary that best sparsifies the 
input data. 

Although successful in signal processing applications, the 
K-SVD algorithm “as-is” may not be suited for machine 
learning tasks such as classification or regression, as its 
primary goal is to achieve the best reconstruction of the input 
data, ignoring any discriminative information such as labels 
or annotations. Many suggestions have been made to extend 
DL to deal with labeled data. The SRC method by Wright 
et al. El achieved impressive results in face recognition by 
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sparse coding each test sample over a dictionary containing 
the train samples from all classes, and choosing the class that 
presents the best reconstruction error. In El, lfT3l Mairal 
et al. added a discriminative term to the DL model, and 
later incorporated the learning of the classifier parameters 
within the optimization of DL. The work reported in Cl 
by Zhang et al. was the first to incorporate the learning 
of the classifier parameters within the framework of the K- 
SVD algorithm. A similar extension has been made in ifTSl . 
ca by Jiang et al., where in a addition to the classifier 
parameters, another discriminative term for the sparse codes 
was added and optimized using the regular K-SVD. In ifTTl 
Yang et al. created an optimization function which forces both 
the learned dictionary and the resulting sparse coefficients to 
be discriminative. These algorithms and others that relate to 
them have been shown to be quite competitive with the best 
available learning algorithms, leading often times to state-of- 
the-art results. 

In machine learning, kernels have provided a straight¬ 
forward way of extending a given algorithm to deal with 
nonlinearities. Prominent examples of such algorithms include 
kernel-SVM Cl, kernel-PCA (KPCA) El and Kernel Fisher 
Discriminant (KFD) 1201 . Suppose the original data can be 
mapped to a higher dimensional “feature space”, where tasks 
such as classification and regression are far easier. Under the 
proper conditions, the “kernel trick” allows one to train a 
learning algorithm in the higher-dimensional feature space, 
without using explicitly the exact mapping. This can be done 
by posing the entire algorithm in terms of inner products 
between the input signals, and later replacing these inner- 
products with kernels. One fundamental problem when using 
the kernel trick is that one is forced to access only the inner 
products of signals in feature space, instead of the signals 
themselves. A direct consequence of this is the need to store 
and manipulate a large kernel matrix K of dimension N x N 
(N being the size of the training set), which contains the 
modified inner products of all pairs of input examples. 

In recent years, kernels have also been incorporated in the 
field of sparse representations, both in tasks of sparse coding 
II2TI - II27I and dictionary learning EH, ED-ESl- The starting 
point of this paper is the kernel DL method termed “Kernel K- 
SVD” (KKSVD) by Nguyen et al. The novelty in ll28l is in the 
ability to fully pose the entire DL scheme in terms of kernels, 
using a unique-structured dictionary which is a multiplication 
of two parts. The first, a constant matrix called the “base¬ 
dictionary”, contains all of the mapped signals in feature space, 
and the second, called the “coefficient-dictionary”, which is 
actually updated during the learning process. The KKSVD 
suffers from the same issues arising when applying the kernel 
trick in general. Specifically, in large-scale datasets, where the 
number of input samples is of the order of thousands and 
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beyond, the KKSVD quickly becomes impractical, both due 
to runtime and in the required storage space. 

While kernel sparse representation is becoming more com¬ 
mon, the existing algorithms are still challenging as they suffer 
from problems mentioned above. The arena of linear DL on 
the other hand, has a vast selection of existing tools that are 
implemented efficiently, enabling learning a dictionary quite 
rapidly in various settings and even if the number of examples 
to train on goes to the Millions. Indeed, in such extreme cases, 
online learning becomes appealing 041 . 

As we show hereafter, our proposed method, “Linearized 
Kernel Dictionary Learning” (LKDL), enjoys the benefits of 
both worlds. LKDL is composed of two stages; kernel matrix 
approximation, followed by a linearization of the training 
process by the creation of “virtual samples” OSlI . In the first 
stage, we apply the Nystrom method to approximate the kernel 
matrix K, using a sub-sampled set of its columns. We explore 
and compare several such sub-sampling strategies, including 
core-sets, k-means, uniform, column-norm and diagonal sam¬ 
pling. Rather than using K (or its approximation), we proceed 
with the assumption that it originates from a linear kernel, i.e. 
K = F^F, and thus, instead of referring to K, we calculate 
the virtual samples F, using standard eigen-decomposition. 
After obtaining these virtual training and test sets, we apply an 
efficient off-the-shelf version of linear dictionary learning and 
continue with a standard classification scheme. This process 
essentially “linearizes” the kernel matrix and combines the 
nonlinear kernel information within that of the virtual samples. 

We evaluate the performance of LKDL in three aspects: 
(1) first, we assure that the added nonlinearity in the form of 
the virtual datasets indeed improves classification results (with 
relation to linear DL) and performs comparably well as the 
exact kernelization performed in KKSVD; (2) we demonstrate 
the differences in runtime between the two methods and (3) we 
show the easiness of integration of LKDL with any existing 
DL algorithm, including supervised DL. 

We should note that a shorter version of this paper has 
been submitted to NIPS 2015. This paper extends over that 
submission in several ways: (i) it broadens the survey of past 
work on supervised and kernel DL; (ii) it adds the combination 
of the proposed scheme with supervised DL, applied to two 
leading algorithms; and (iii) it expands the experimental results 
section substantially. 

This paper is organized as follows: section HI] provides 
background to classical reconstructive DL with emphasis on 
the K-SVD and two methods of supervised DL, all of which 
are used later in the experimental part as the linear foundations 
over which our scheme is employed. Section |In| discusses 
Nguyen’s KKSVD algorithm for kernel DL and discusses its 
complexity. Section |IV| presents the details of our proposed 
algorithm, LKDL, for kernel DL. This section also builds a 
wider picture of this field, by surveying the relevant literature 
of incorporating kernels into sparse coding and the dictionary¬ 
learning. Section |V| shows results corroborating the effective¬ 
ness of our method, and finally, section|Vl|concludes this paper 
and proposes future research directions. 


IT Linear Dictionary Learning 

This section provides background on classic reconstructive 
DL, as well two examples of discriminative, supervised DL. 
The purpose of this section is to recall several key algorithms, 
the MOD and K-SVD, the FDDL, and the LC-KSVD, which 
we will kernelize in later sections. 

A. Background 

In sparse representations, given an input signal x S 
and a “dictionary” D G one wishes to find a “sparse 

representation” vector, 7 G R™ such that x Ri x = D 7 . 
The dictionary D = [di,... ,dm] consists of “atoms” which 
faithfully represent the set of signals x G X. The task 
of finding a signal’s sparse representation is termed “sparse 
coding’Q or “atom decomposition” and can be solved using 
the following optimization problem; 

7 = argmin||x-D7||2 s.t. II 7 II 0 < 9, (1) 

~i 

where q is the number of nonzero coefficients in 7 , often 
referred to as the “cardinality” of the representation, and the 
term || 7 ||o is the Zo-nonn which counts the number of non¬ 
zeros in 7 . This problem is known to be NP-hard in general, 
implying that even for moderate m (number of atoms), the 
amount of required computations becomes prohibitive. The 
group of algorithms which attempt to find an approximated 
solution to this problem are termed “pursuit algorithms”, and 
they can be roughly divided into two main approaches. The 
first are relaxation-based methods, such as the “basis-pursuit” 
ll^ . which relaxes the norm to be li instead of Iq. The li- 
norm still promotes sparsity while making the optimization 
problem solvable with polynomial-time methods. The second 
family of algorithms used to approximate the solution of ([T]) 
are the greedy methods, such as the “matching-pursuit’ ’ Ell, 
which find an approximation one atom at a time. In this paper 
we shall mostly address the latter group of pursuit algorithms, 
and more specifically, the Orthogonal Matching Pursuit (OMP) 
ESl algorithm, which is known to be efficient and easy to 
implement. 

B. Classic Dictionary Learning 

In “dictionary learning” (DL), one attempts to compute the 
dictionary D G R^^*” that best sparsifies a set of examples, 
serving as the input data X G R^^^. A commonly used 
formulation for DL is the following optimization problem: 

argmin||X-Drill s.t. l<i<N II 7 JI 0 < (?, (2) 

D,r 

where || ■ \ \f is the Frobenius norm and T = [ 7 ^,..., 7 ^] G 
^mxN jg ^ matrix containing the sparse coefficient vectors 
of all the input signals. The problem of DL can be solved 
iteratively using a Block Coordinate Descent (BCR) approach, 

^The term “Sparse Coding” might be confusing because it is used in 
machine learning and brain research for describing the process we refer to as 
“Dictionary Learning”. In this paper we follow the terminology of signal and 
image processing, and thus “sparse-coding” implies the quest for the sparse 
solution for an approximate linear system. 
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of alternating between the sparse coding and dictionary update 
stages. Two such popular methods for DL are the MOD ||9l 
and K-SVD HI. 

In MOD m, once the sparse coefficients in iteration t, 
Ft, are calculated using a standard pursuit algorithm, the 
optimization problem becomes: 

T)t = argmin||X - Dr^H^. (3) 

D 

This convex sub-problem leads to the analytical batch update 
of the dictionary using Least-Squares: 

Dt = xrnrtrn-'= xrj. (4) 

The problem with MOD is the need to compute the pseudo¬ 
inverse of the often very-large F. The K-SVD algorithm by 
Aharon et al. ifTOl proposed alleviating this and speeding up 
the overall convergence by updating the dictionary one atom 
at a time. This amounts to the use of the standard SVD 
decomposition of rank-1 for the update of each atom. 


C. Fisher Discriminant Dictionary Learning (FDDL) 

The work reported in 03 proposes an elegant way of 
performing discriminative DL for the purpose of classification 
between L classes by modifying and extending the objective 
function posed in (l2]l. A fundamental feature of this method 
is the assumption that the dictionary is divided into L disjoint 
parts, each serving a different class. 

Let X = [Xi,..., Xi] S be the input examples of 

the L classes, where X^ S are the examples of class i. 

Denote D = [Di,...,Dl] e and F = [Fi,...,Fl] S 

■^MxN j-jjg dictionary and the corresponding sparse represen¬ 
tations. The part F^ G further decomposed 

as follows: F, = [(F,^)^,..., (F^)^,..., (Ff where 
F^ G are the coefficients of the samples X^ G 

over the dictionary Dj G Rp^™j . Armed with the above 
notations, we now turn to describe the objective function 
proposed in [14] for the discriminative DL task. This objective 
is composed of two parts. The first is based on the following 
expression: 

r(X„D,F,) = 

L 

||X, - DF,||^ + ||X, - D,F*||| + ^ IID.F^Ill 

j=i 

The first term demands a good representation of the f-th 
class samples using the whole dictionary, and the second term 
further demands a good representation for these examples 
using their own class’ sub-dictionary. The third term is of 
different nature, forcing the i-th class examples to minimize 
their reliance on the other sub-dictionaries. Naturally, the 
overall penalty function will sum the expression in (|5]l for 
all the classes i. 

We now turn to describe the second term in the objective 
function, which relies on the Fisher Discriminant Criterion 

We define two scatter expressions, both applied to the 
representations. The first, 5'w(r) computes the within class 


spread, while the second, S'_b(F) computes the scatter between 
the classes: 


SwiT)=^, V ilk- D^)hk- 
'^s(r)=V. n.ifi, - , 
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and /r, /Tj are the mean vectors of the learned sparse coefficient 
vectors, F and F^ correspondingly. Naturally, we aim to 
minimize the first while maximizing the second. 

The final FDDL model is defined by the following opti¬ 
mization expression: 


J, 


D.r) = argminj^ r(X,,D,Fi) -f Ai||F|| 


(D,r 


A2[tr(5H^(F)-5B(F))+p||F|||.]}. 


(7) 


The term ||F|||. serves as a regularization that ensures the 
convexity of ®. 

The detailed optimization scheme of this rather complex 
expression is described in ns, along with two classification 
schemes, a global and a local one, depending on the size of 
the input dataset. 


D. Label Consistent KSVD (LC-KSVD) 

In Ha, HSl, an alternative discriminative DL approach 
is introduced, in which the learning of the dictionary, along 
with the parameters of the classifier itself, is performed si¬ 
multaneously, leading to the scheme termed “Label-Consistent 
K-SVD” (LC-KSVD). These elements are combined in one 
optimization objective, which is handled using the standard 
K-SVD algorithm. 

In order to improve the performance of a linear classifier, 
an extra term is added to the reconstructive DL optimization 
function: 

argmin||X-DF|||-ba||Q-TF|||, s.f Vi, || 7 j||o < g. (8) 

D,T,r 

The second term encourages the sparse coefficients to be dis¬ 
criminative. More specifically, the matrix Q = [qi,..., q^v] G 
foj- fjje “ideal” sparse-coefficient matrix for 
discrimination, where q^ is a binary vector encoding the as¬ 
signment of each example to its destination atoms. The matrix 
T G transforms the sparse codes F to their idealized 

versions in Q. This term thus promotes identical sparse codes 
for input signals from the same class and orthogonal sparse 
codes for signals from different classes. 

In addition to the discriminative term added above, the 
authors in DSl propose learning the linear classifier within 
the framework of the DL. A linear predictive classifier is used 
of the form: /( 7 , 0) = © 7 , where © G R^^™. The overall 
objective function suggested is: 

argmin |||X - T>r\\l 4- a||Q - TFf^ 

D,0,T,r^ (9) 

+ /3||H-©F||2,}, s.t. Vt, II7JI0 < g, 

where the classification error is represented by the term 
||H - ©F||i, © contains the classifier parameters, H = 
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[hi,...,h 7 v] G is the label matrix of all input ex¬ 

amples, in which the vector = [0,0,..., 0,1, 0,..., 0]^ 
contains only zeros apart from the index corresponding to the 
class of the example. The optimization function in (|9|l can also 
be written as follows; 

argmin||X„e„ - D„e„r|||., s.t. Vr, ||7illo < 9, (10) 

D„e».r 

where X^^ew = (X^, G ^(p+m+L)xN 

D„eu, = G ]R(p+™+i)x™. The uni¬ 

fied columns in D„etu are all normalized to unit I 2 norm. The 
optimization objective in ( fTOl l can be solved using standard 
DL algorithms, such as K-SVD. 

The authors propose two cases of LC-KSVD; LC-KSVD2, 
in which the parameters of the classifier are learned along 
with the dictionary, as shown in (|9]l and the second, LC- 
KSVD 1, in which they are calculated separately by: © = 

rr^ -G T 2 l^ FH^. More details on these expressions and 
the numerical scheme for minimizing the objective function 
can be found in QSl, M- A new sample x is classified by 
first sparse coding over the dictionary D, and then, applying 
the classifier © to estimate the label j. 

III. Kernel Dictionary Learning 

This section focuses on kernel sparse representations, with 
emphasis of the kernel-KSVD method by Nguyen et al, which 
we will compare with later on this paper. 

A. Kernels - The Basics 

In machine learning, it is well-known that a non-linear 
mapping of the signal of interest to higher dimension may 
improve its discriminability in tasks such as classification. Let 
X G T” be a signal in input space, which is embedded to a 
higher dimensional space IF using the mapping $, x G 
<I>(x) G (P ^ p and it might even be infinite). The space 
in which this new signal $(x) lies is called the “feature space”. 
The next step in machine learning algorithms, in particular 
in classification, would be learning a classifier based on the 
mapped input signals and labels. This task can be prohibitive if 
tackled directly. A way around this hurdle is the “kernel trick” 
iol, ED, which allows computing inner products between 
pairs of signals in the feature space, using a simple nonlinear 
function operating on the two signals in input space: 

k(x,x') = ($(x),$(x')) = $(x)^$(x'), (11) 

where k is the “kernel”. This relation holds true for positive- 
semi-definite (p.s.d) and Mercer kernels ifTSl . Thus, suppose 
that the learning algorithm can be fully posed in terms of inner 
products. In such a case, one can achieve a “kernelized” ver¬ 
sion by swapping the inner products with the kernel function, 
without ever operating in the feature space. 

In case there are N input signals X = [xi,..., x^v] G 
]^pxN, j-jjg “kernel matrix” K G holds the kernel values 

of all pairs of input signals; 

Kij = «:(xi,Xj) = ($(xi), $(xj)) , yiJ = l..N. (12) 


An inherent constraint in kernel algorithms is the fact that 
the solution vectors, for example the principal components in 
KPCA, are expansions of the mapped signals in feature space: 

N 

v = ^ai$(xi). (13) 

i=l 

The subspace in which the possible solutions lie, can be 
viewed as an N dimensional surface residing in P EH- 
Motivated by the inability to directly approach the mapped 
signals in feature space, researchers have suggested embedding 
the N dimensional surface to a finite Euclidean subspace, 
where all geometrical properties, such as distances and angles 
between pairs of <I>(xi)'s, are preserved ED- The embedding 
is called the “kernel empirical map” and the resulting subspace 
is referred to as the “empirical feature subspace”. One way to 
embed a given signal x to the empirical feature space is by 
calculating kernel values originating from inner products with 
all input training examples: 

X[k(x,Xi), ..., «;(x,XAr)]^ . (14) 

B. Kernel Dictionary Learning 

A straightforward way to kernelize dictionary learning 
would be exchanging all the signals (and dictionary atoms) 
with their respective representations in feature space: x — 
$(x),d —4’(d) and rephrasing the algorithm such that it 
contains solely inner products between pairs of these ingredi¬ 
ents. A difficulty with this approach is that during the learning 
process, the dictionary atoms are in feature space. As there is 
no exact reverse mapping from the updated inner products to 
their corresponding signals in input space, there is no direct 
way of accessing the updated dictionary atoms, as practiced 
in linear DL. 

In order to solve this problem, the authors in ll^ suggest 
decomposing the dictionary in feature space into; $(D) = 
<1>(X)A, where $(X) is the constant part, called the “base¬ 
dictionary”, which consists of all mapped input signals, and 
A is the only part updated during the learning, called the 
“coefficient-dictionary”. Just like in the case of the KPCA 
EH, the obtained dictionary is limited to an A-dimensional 
manifold in the feature space. 

The kernel dictionary learning can now be formulated as 
the following optimization problem: 

argmin||$(X)-$(X)Arf^ s.t. Vz = L.A II7JI0 < 9- 
A,r 

(15) 

Similarly to linear DL, this optimization problem can be 
solved iteratively by first performing sparse coding with a 
fixed dictionary A, then updating the dictionary according 
to the computed sparse representations F, and so on, until 
convergence is reached. The kernelized equivalent of sparse 
coding is given by; 

argmin||$(z) - $(X)A7||2 s.t. II 7 II 0 < 9, (16) 

■7 

where z is the input signal. As mentioned earlier, the sparse 
coding algorithm we focus on in this paper, as well as in 
Nguyen’s KKSVD lESl, is the OMP 1^ and its kernel 
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version, KOMP Il28l . Table |T] presents two of the main stages 
in the OMP algorithm, which are the Atom-Selection (AS) 
and Least-Squares (LS) stages, and their kernelized version. 
As can be seen, these stages can be completely represented 
using the coefficient dictionary A, the sparse representation 
vector 7 and the kernel functions K(X, X) G and 

K(z, X) = [k(z, xi),..., k(z, xat)] G 

The dictionary update stage, can also be kernelized. In the 
MOD algorithm ||9], the up(i3.tc of A in itcr3.tion i -|- 1 is 
given by: Aj+i = Pf = r|, being the solution to: 

argmin||$(X) — $(X)Ar|||n. A similar update can be derived 

for the K-SVD algorithm, as described in depth in am, iMi- 
C. Difficulties in KDL 

There are a few difficulties that arise when dealing with 
kernels, and specifically in kernel dictionary learning. In the 
input space, a signal x G can be described using its 
own p features, while in feature space it is described by 
its relationship with all of the other N input signals. The 
runtime and memory complexity of a kernel learning algorithm 
changes accordingly and depends on the number of input 
signals, instead of on the dimension of the signals. This 
observation is also true for Nguyen’s KDL where the kernel 
matrix K is used during the sparse coding and dictionary 
update stages, and must be stored in full. In applications where 
the number of input samples is large, this dependency on the 
kernel matrix becomes prohibitive. In table |I] one can see the 
complexity of the main stages in the KOMP algorithm and 
compare it to the linear OMP version. It is clear that both 
the atom-selection and the least-squares stages are governed 
quadratically on the size of the input dataset. 

Another inherent difficulty in kernel methods is the need 
to tailor each algorithm such that it is formulated solely 
through inner products. This constraint creates complex and 
cumbersome expressions and is not always possible, as some 
steps in the algorithm may contain a mixture of the signals 
and their mapped version. 

IV. The Proposed Algorithm 

Section [H] and [HI] gave some background to the task we 
address in this paper. We saw that kernelization of the DL 
task can be beneficial, but unfortunately, we also identified 
key difficulties this process is accompanied by. In this work 
we aim to propose a systematic and simple path for kernelizing 
existing dictionary learning algorithms, in a way that will 
avoid the problems mentioned above. More specifically, we 
desire to be able to kernelize any existing DL algorithm, be 
it unsupervised or supervised, and do so while being able to 
work on massive training sets without the need to compute, 
store, or manipulate the kernel matrix K. In this section 
we outline such a solution, by carefully describing its key 
ingredients. 

A. Kernel matrix approximation 

Let X G RP^^ be the input signals and K G their 

corresponding kernel matrix. We shall further assume that K 


is of rank r < N. As long as the kernel satisfies Mercer’s 
conditions of positive-semi-definiteness it can be written as 
an inner product between mapped signals in feature space: 
K.ij = (<I>(xi), $(x_; )). Assume, for the sake of the discussion 
here, that the kernel function applies a simple inner product, 
i.e.: Kij = (f^, fj) = t^fj, where f^, ij are the feature vectors 
corresponding to Xi and Xj, respectively. Thus, the kernel 
matrix would have the form: K = F^F = $(X)^$(X), 
where F is a matrix of size r x N (r is the feature-space 
dimension, and we have assumed that it is smaller than N). 
One can refer to the vectors in F as “Virtual Samples” 

ll35l . This way, instead of learning using the kernel matrix 
K, one could work on these virtual samples directly using a 
linear learning algorithm, leading to the same outcome. In the 
following, we will leverage on this insight. 

The kernel matrix is generally symmetric and positive- 
semi-definite, and as such can be decomposed using eigen- 
decomposition as follows: K = UAU^, where A G R'’^’’ is 
a diagonal matrix containing all of the nonzero eigenvalues of 
K in descending order and U G contains the matching 

orthonormal eigenvectors. An approximation of the virtual 
samples can be achieved by: 

F = = A-^/^U'^K. (17) 

The virtual samples can be viewed as a mapping of the original 
input signals to an r-dimensional empirical feature space. 

X {k{x, xi), k{x, X 2 ),..., «;(x, xn))'^ . (18) 

An approximated kernel empirical map of dimension k < r 
can also be obtained by considering only the top k eigenvalues 
and corresponding eigenvectors —>■ F^ = (U^)^. 

This “linearization” is the mediator between kernel DL 
which is obligated to store and manipulate the kernel matrix 
K, and linear DL that can deal with very large datasets. 
The decomposition of the matrix K to its eigenvalues and 
eigenvectors is a demanding task in itself, both in time 
0{N‘^k) and in space 0{N‘^). Next we will show how a good 
approximation of the matrix K can be constructed with only 
a subset of its columns, using the popular Nystrom method. 


B. Nystrom method 

A common necessity in many algorithms in signal process¬ 
ing and machine learning is deriving a relatively accurate and 
efficient approximation of a large matrix. An attractive method 
that has gained popularity in recent years is the Nystrom 
method ll44l - ll46l . which generates a low-rank approximation 
using a subset of the input data. The original Nystrom method, 
first introduced by Williams and Seeger ll44l . proposed using 
uniform sampling without replacement. 

Let K G R^^-^ be a symmetric positive semi-definite 
matrix, and in particular for the discussion here, a kernel 
matrix. Suppose c < N columns from the matrix K are 
sampled uniformly without replacement to form the reduced 
matrix C G Without loss of generality, the matrices C 

and K can be permuted as follows: 


C = 


W 

s 


and 



S^' 

B ’ 


(19) 
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TABLE I 

Complexity of the atom selection (AS) and the least square (LS) stages in linear and kernel-OMP. Ig is the current support 

VECTOR and |Is I its LENGTH, Dg , Ag AND 7 g ARE SUB-MATRICES OF D, A, AND 7, RESPECTIVELY, CORRESPONDING TO Ig . Ft IS THE RESIDUAL. 


Term Complexity 


OMP-AS 

(rt, d,) = (z - Ds7s, d,) = z^d,' - 

0(p|lsl +p) 

KOMP-AS ED 

K(z, X)a,^ - X)a,^ 

0 (N^ + \ls\N + N) 

OMP-LS 

7 ,. = (DJDs)-'dJz 

0(p|IsP-bp|Is|-b|Isl®) 

KOMP-LS El] 

7s = [A5K(X,X)Asri (K(z,X)As)^ 

0(A2|Is|-bA|Is|-b|Is|3) 


where W S is the kernel matrix of the intersection of 

the chosen c columns with c rows, B € is 

the kernel matrix composed of the N — c remaining rows 
and columns, and S G ^ mixture of both. 

The Nystrom method uses both C and W to construct an 
approximation of the matrix K as follows: 

KaiCW^C^, (20) 

where (•)^ denotes the pseudo-inverse. The symmetric matrix 
W can also be posed in terms of eigenvalues and eigenvectors: 
W = VSV^, where S is a diagonal matrix containing the 
eigenvalues of W in descending order and V contains the 
matching orthonormal eigenvectors. The pseudo-inverse of W 
is given by = VS^V^. The expression of can 

be similarly derived: (Wt)i/2 = (S'f)i/2v^. 

We can represent K as before, using linear inner-products 
of the virtual samples, and plug in Nystrom’s approximation: 

K = F^F = CW^C^ = CVS^V^C^, (21) 

and derive the final expression of the virtual samples by: 

F = (22) 

The rank-fc {k < c) approximation can similarly be derived: 

Ffc = (23) 

where = diag(CTi,..., G contains the k 

largest eigenvalues of W and G the corresponding 

orthonormal eigenvectors. 

After performing the Nystrom approximation, the space 
complexity of kernel DL reduces from 0{N‘^) to 0{Nc), the 
size of the matrix C, which is used during the computation 
of the virtual samples. The time complexity of the Nystrom 
method is 0{Nck + c^k), where 0{Nck) represents the 
multiplication of and 0{(?k) stands for the eigenvalue 

decomposition (and inversion) of the reduced matrix W^. 

Note that the process of computing the virtual samples may 
seem inefficient, but it is performed only once, after which 
the complexity of the DL is dictated by the chosen algorithm, 
and not by the “kernelization”. In addition, in scenarios where 
the number of input examples is very large, the ratio c/N in 
Nystrom’s method can be reduced greatly, i.e. c N, making 
the approximation even less dominant in terms of runtime and 
memory, while retaining almost the same accuracy. 


C. Sampling Techniques 

Since the Nystrom method creates an approximation of a 
large symmetric matrix based on a subset of its columns, 
the chosen sampling scheme plays an important part. The 
basic method proposed originally by Williams and Seeger 
was uniform sampling without replacement ill. The columns 
of the Gram matrix can be alternatively sampled from a 
nonuniform distribution. Two such examples of nonuniform 
sampling include “column-norm sampling” El, where the 
weight of the ith column k® is its I 2 norm: pi = ||k®|p/||K|||., 
and “diagonal sampling” HSl where the weight is proportional 
to the corresponding diagonal element: pi = 

These methods can be made more sophisticated but require 
additional complexity: 0{N) in time and space for diagonal 
sampling and 0{N^) for column-norm sampling. A compre¬ 
hensive theoretical and empirical comparison of these three 
methods is provided in ll49l . 

In ll50l . Zhang et al. suggested an alternative approach 
of selecting a few “representative” columns in K by hrst 
performing K-means clustering, then computing the reduced 
matrix C based on these so-called “cluster centers”. Denote by 
Xfl the resulting c cluster centers, created from the original 
data X. The computation of the kernel matrices C and W 
would be: C = K(X,Xfl) and W = K(Xfl,Xfl). Zhang et 
al. also show that the combination of k-means clustering with 
the Nystrom method minimizes the approximation error. 

Another appealing sampling technique has been suggested 
in the context of coresets ED. The idea is to sample the given 
data by emphasizing unique samples that are ill-represented 
by the others. In the context of our problem, we sample c 
signals from X according to the following distribution: pi = 
err{xi, p)/ fi), where err{xi,p) = ||xi - 

/i .7112 is the representation error of the signal x^, correspond¬ 
ing to the mean of all training signals p = (l/W) ^i- 

D. Linearized Kernel Dictionary Learning (LKDL) 

Let be a labelec@ training set, arranged as a 

structure in L categories: Strain = [Xi,...,Xi] G 
where X^ contains the training samples that belong to the 
ith class and N = tti- Our process of kernel dictionary 

learning is divided in two parts: the hrst, a pre-processing stage 
that creates new virtual training and test samples, followed by 
a second stage of applying a standard DL. This whole process 
is termed “Linearized Kernel Dictionary Learning” (LDKL). 

^We consider here the case of labeled data, but the labels can be omitted, 
thus reducing to the simple representative DL format. 
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The pre-processing stage is shown in algorithm [T] First, 
the initial training set Strain is sampled using one of the 
techniques mentioned in section ITV-Cl creating the reduced set 
= [xfli,..., G Then the matrix C G in 

Nsytrom’s method is calculated by simply applying the chosen 
kernel on each and every pair of columns in Strain and X/j. 
Next, the reduced matrix W G is both calculated and 

later on inverted using rank-fc eigen-decomposition. Finally the 
virtual training samples Ftrain G are calculated using 

equation (|2^ . The Nystrom method permits approximating a 
new test vector itest using equation (fTSl l. by using the mapping 
already calculated based on the training set, and multiplying by 
the joint kernel vector of the sampled set and the current 
test sample; K(X_r, xjest): 

/ t \ T’ T 

f test ( 

(24) 

Once the training and test sets are represented as virtual 
samples: Ftrain and Ftest, any linear DL-based classification 
method can be implemented. In the context of classification 
we follow Nguyen’s “distributive” approach lIMl of learning 
L separate dictionaries [Di,...,Di] per each class, then 
classifying each test sample by first computing its sparse 
coefficient vector over each of the dictionaries 
and finally choosing the class corresponding to the smallest 
reconstruction error: 

n = \\ft,,t-^^7^\\^ V7 = 1..L. (25) 


Algorithm 1 LKDL Pre-Processing 
1: Input; 'K-train = [Xi,...,Xl], Xjest, the kernel k, 
smp_method, c, k 

2: X/{ = sub_sample{'X.traim smp_method,c) 

3: Compute Ctrain — 

4: Compute W = K(Xr,Xr) 

5: Approximate Wfc using k largest eigenvalues and eigen¬ 
vectors Wfe = VfcSfcV^ 

/ x\l/2 

6: Compute virtual train set Ftrain = (^1) ^k^fram 
7: Compute Ctest — 1 

/ ,\l/2 

8: Compute virtual test set Ftest = ( ^1) 

9: Output: Ftra^n = [Fi, . . . , Fl], Ftest 


E. Relation to Past Work 

The existing works on kernel sparse representations can 
be roughly divided to two categories. The first corresponds 
to ‘analytical’ methods that operate solely in the feature 
domain and use the kernel trick to find an analytical solution, 
be it sparse coding or dictionary update 12^ . 1241 . Il28ll . 
ED. The other category refers to ‘empirical’ or ‘approximaP 
methods that operate in the input space, while making some 
approximation or assumption regarding the mapped signals in 
feature space, in order to alleviate some of the constraints 
when working with kernels ED, Ea, ES- Naturally, our 
work belongs to the second group of contributions. 


In 2002, Bengio et al. ED kernelized the matching pursuit 
algorithm by using the kernel empirical map of the input 
training examples as dictionary atoms. By referring to the 
kernel empirical map $e instead of the actual mapped signals 
in IF, the authors could perform standard linear matching 
pursuit without having to rewrite the algorithm in terms of 
inner products. In this case, the constraint of a p.s.d kernel 
was no longer mandatory. In 2005 ES, a similar concept 
of embedding the signals to a kernel empirical map was 
used to kernelize the basis pursuit algorithm. This approach 
of working in the input domain with an approximation of 
the kernel feature space is very similar to ours and can be 
described by the following embedding, evaluated over the 
entire training dataset {x^}^]^: 

X (x) = [«:(xi, x),..., k(xjv , x)]^ . (26) 

The case in our algorithm, where all the training signals are 
involved in the approximation of the kernel matrix (c = 
iV, C = W = K), results in a similar expression for the 
virtual samples: 

F = (27) 

where S and V are the eigenvalues and eigenvectors of the 
matrix K. The embedding in this case is thus 

$e(x) = (S^/^)'l'V^[K(xi,x),...,K(xAr,x)]^. (28) 

Contrary to ED, E2, our embedding preserves the similari¬ 

ties in the high-dimensional feature space, represented by the 
inner products, i.e, 

4>e(x)^$e(x') ~ /c(x,x') = 4>(x)^4>(x'), (29) 

where we have used the expression = VS^V^. In 
addition, both ED and E 2 I focus on sparse coding only and 
do not address the accuracy of the kernel empirical map, nor 
its dimension, which can be highly restrictive in large-scale 
datasets. 

Both Gao et al. in 2010 and Li et al. in 2011 
ED, proposed an analytical approach of kernelizing the basis 
pursuit and orthogonal matching pursuit algorithms. Contrary 
to ED and II 22 I . the authors replaced all the inner products 
by kernels and worked entirely in the feature domain. Clas¬ 
sification of faces and objects were achieved in using a 
similar approach as in the SRC algorithm ED- Aside from 
kernelizing the SRC algorithm, EJI also suggested updating 
the dictionary one atom at a time. By zeroing the derivative 
of the optimization function with respect to each atom, the 
authors acquired in the same term, a mixture of both the atom 
itself and its kernel with the input examples. As the resulting 
equation could not be solved analytically, an iterative fixed 
point update was implemented. 

In 2012 Zhang et al. E5l provided an alternate approach 
of kernelizing the SRC algorithm. Instead of working with 
the implicit mapped signals in the feature space $(y), the 
authors performed dimensionality reduction first, using the 
KPCA algorithm, then fed the resulting nonlinear features 
to a linear li basis pursuit solver. It can be shown that 
kernel PCA eventually entails the eigendecomposition of the 








kernel matrix (more accurately, the centered kernel matrix), as 
does our algorithm. The difference is that our method, apart 
from providing an accurate kernel mapping which preserves 
similarities in feature space, also avoids dealing with the kernel 
matrix altogether in the training stage, making it possible to 
work with large datasets. 

V. Experimental Results 
In the following section we highlight the three main benefits 
of incorporating LKDL with existing DL: (1) improvement 
in discriminability, which results in better classification (2) 
a small added computational effort by LKDL in comparison 
with typical kernel methods and (3) the ability to incorporate 
the LKDL seamlessly in virtually any existing linear DL al¬ 
gorithm, contributing to more compact dictionaries and sparse 
representations. 

A. Unsupervised Dictionary Learning 

In this part we demonstrate the performance of our al¬ 
gorithm in digit classification on the USPS and MNIST 
databases. Our method of classification consists of first pre¬ 
processing the training and test data using LKDL, then per¬ 
forming regular, standard dictionary learning, using existing 
tools and finally deploying the classification scheme in section 
IIV-DI Lor sparse coding and dictionary learning, we use the 
batch-OMP and efficient-KSVD implementations from the lat¬ 
est OMP-Box (vlO) and KSVD-Box (vl3) librarie ll52l . Dur¬ 
ing all experiments we use the KKSVD algorithm explained in 
section HIl-B I ll28l . Il30ll as our reference, in addition to regular 
linear KSVD. We use the original code of Nguyen’s KKSVE0. 

A fair comparison in accuracy and runtime, between LKDL 
and KKSVD can be made, as KKSVD uses the same functions 
from the OMP and KSVD libraries mentioned earlier. The k- 
mean^ and coresej^ sampling techniques were also adopted 
from existing code. All of the tests were performed on a 
64-Bit Windows? Intel(R) Core(TM) i7-4790K CPU with 
16GB memory. The initial dictionary is a random subset of 
m columns from the training set in each class. 

1) USPS dataset: The USPS dataset consists of 7,291 
training and 2,007 test images of digits of size 16 x 16. All 
images are stacked as vectors of dimension p = 256 and 
normalized to unit I 2 norm. Lollowing the experiment in lIMIl . 
we choose the following parameters: 300 dictionary atoms per 
class, cardinality of 5 and 5 iterations of DL. The chosen 
kernel is polynomial of order 4, i.e. k(x, x') = (x^x')^. 
The approximation parameters were chosen empirically using 
coarse-to-fine search and were set to: c = 20% of N training 
samples and k = 256, the original dimension of the digits. 
The displayed results are an average of 10 repeated iterations 
with different initialization of the sub-dictionaries and different 
sampled columns X/j in Nystrom’s method. 

Lirst we evaluate the quality of the representation of the 
kernel matrix using Nystrom’s method. We randomly choose 

^Found in http://www.cs.techmon.ac.il/~ronrubin/software.html 

^Eound in http://www.umiacs.umd.edu/~hien/KKSVD.zip 

^K-means - http://www.mathworks.com/matlabcentral/fileexchange/31274-fast' 

®Coreset - http://web.media.mit.edu/~michaf/index.html 


2,000 samples from USPS and approximate the resulting 
kernel matrix. In order to isolate the effect of column sub¬ 
sampling, we do not perform additional dimensionality reduc¬ 
tion using eigen-decomposition and thus choose k = 256. Live 
sampling techniques were examined: uniform ll44ll . diagonal 
HSll . column-norm EZl, k-means ll50l and coreset ED. We 
also added the ideal reconstruction using rank-c SVD decom¬ 
position, which is optimal with respect to minimizing the ap¬ 
proximation error, but takes much longer time to compute. We 
perform the comparison using the normalized approximation 
error: 


err = 


|K-K|| 

IIKIIp 


(30) 


where K is the original kernel matrix and K its Nystrom 
approximation. Pig. [T^ shows the quality of the approximation 
versus the c/N ratio, the percent of samples chosen for the 
Nystrom approximation. As expected, SVD performs the best, 
as it is meant exactly for the purpose of providing the ideal 
rank-c approximation of K. The second best approximation is 
obtained by k-means, which provides 98.5% accuracy in terms 
of the normalized approximation error, with only 10% of the 
samples. All other methods perform roughly the same. The 
differences in approximation quality reduce as the percent of 
chosen samples grows to half of the input dataset. 

Next we examine the effect of sub-sampling on the clas¬ 
sification accuracy of the entire database of USPS. Lig. M 
shows the classification accuracy as a function of c/N, along 
with the constant results of linear KSVD and KKSVD (which 
do not depend on c). There is a gap of 1% between the 
results of linear KSVD and its kernel variants, which suggests 
that kernelization improves the discriminability of the input 
signals. It can be seen that k-means sampling again performs 
best and reaches classification accuracy of KKSVD, with only 
a fraction of the samples. In general, the percent of samples 
in Nystrom approximation does not have much impact on the 
final classification accuracy (apart from small fluctuations that 
arise from the randomness of each run). This can be explained 
by the simplicity of the digit images and the relatively large 
number of training examples. 

Lollowing Nguyen’s setup in ll^ and we inspect the 
effect of corrupting the test images with white Gaussian noise 
and missing pixels. We use the same parameters as before 
and repeat the experiment 10 times with different random 
corruptions. The results of classification accuracy versus the 
standard deviation of the noise and the percent of missing 
pixels are given in Lig.l^andl2bl It is evident that adding the 
kernel improves the robustness of the database to both noise 
and missing pixels. The performance of LKDL follows that of 
KKSVD with a only 20% of the training samples. The trend 
shown in our results is similar to that in EQ!. although the 
results are slightly lower. This can be explained by the fact 
that in ll30l . the authors did not use the traditional partitioning 
of training and test data of the USPS dataset. In this simulation, 
the coreset sampling technique was the best in dealing with 
is the reason it is the only method 

shown. 
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(a) 


(b) 


Fig. 1. Approximation en'or (a) and classification accuracy (b) as a function of c/N, percent of samples used in Nystrdm method. 




(a) 

Fig. 2. Classification accuracy in the presence of Gaussian noise (a) and missing pixels (b). 


(b) 


2 ) MNIST dataset: Next we demonstrate the differences in 
runtime between our method and KKSVD using the larger- 
scale digit database of MNIST, which consists of 60,000 
training and 10,000 test images of digits of size 28 x 28. Same 
as before, the digits were stacked in vectors of dimension 
p = 784 and normalized to unit I 2 norm. We examine 
the influence of gradually increasing the training set on the 
classification accuracy and training time of the input data. 
In this simulation, the entire training set of 60,000 examples 
is reduced by randomly choosing a defined fraction of the 
samples, while maintaining the test set untouched. The runtime 
measured in LKDL includes the time needed to prepare both 
the training and test virtual samples, along with training the 
entire input dataset using linear KSVD. As for KKSVD, the 
runtime includes the preparation of the kernel sub-matrices 
for each class and the kernel DL using KKSVD. Parameters 
in the simulation were: 2 DL iterations, cardinality of 11, 700 
atoms per digit, polynomial kernel of order 2, c = 15% and 
k = 784. The results were averaged over 5 runs. 

The results can be seen in Fig.l^andlJbl Again, the coreset 


sampling method was chosen, as it provided the best results. 
The accuracy of LKDL versus KKSVD is comparable, while 
slightly worse, due to the approximation, but still better than 
the linear version of KSVD. The runtime of LKDL follows 
the one of KSVD, along with a component of calculating the 
virtual datasets. This is expected since our method “piggy¬ 
backs” on KSVD’s performance and complexity. KKSVD’s 
performance however, is dependent quadratically on the num¬ 
ber of input samples in each class. When the database is large, 
the calculation of the virtual datasets (which is performed only 
once), is negligible versus the alternative of performing kernel 
sparse coding thousands of times during the DL process. 


Note that we chose a relatively small number of DL 
iterations in order to reduce the already-long computation time 
of KKSVD. A larger number of DL iterations will lead to 
an even greater difference in runtime between KKSVD and 
LKDL. For training the entire database of MNIST, LKDL is 
19-times faster that KKSVD. 
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Fig. 3. Accuracy (a) and total training time (b) versus the number of input training examples in MNIST database. Runtime is shown in logarithmic scale. 


B. Supervised Dictionary Learning 

In the following set of experiments we demonstrate the 
easiness of combining our pre-processing stage with any DL 
algorithm, in particular the LC-KSVD IITSl and FDDL IflTl . 
both of which are supervised dictionary learning techniques 
that were mentioned earlier. We do so using the original code 
of LC-KSVE0 and FDDlfl Throughout all tests, the training 
and test sets were pre-processed using LKDL to produce 
virtual training and test sets, which were later on fed as 
input to the DL and classification stages of each method. 
In all experiments, no code has been modified, except for 
exterior parameters which can be tuned to provide better 
results. The point in this setup is using an existing technique of 
supervised DL and showing the improvement that our method 
can provide. 

1) Evaluation on the USPS Database: We start with com¬ 
paring the classification accuracy of the same database from 
before, the USPS. First we perform regular FDDL with the 
following parameters; 5 DL iterations, 300 dictionary atoms 
per class, where the dictionary is first initialized using K- 
means clustering of the training examples. The scalars con¬ 
trolling the tradeoff in the DL and optimization expressions 
remained the same as in the demo provided by the authors: 
Ai = 0 . 1 , A 2 = 0.001 and gi = 0 . 1,52 = 0.001 (in ifTTll . 
these are referred to as 71 , 72 ). As for LKDL pre-processing, 
the chosen parameters were: Polynomial kernel of degree 3, 
K-means based sub-sampling of 20% of the training samples 
{c/N = 0.2) and k = 256. All results were averaged over 10 
iterations with different initializations. 

Table [H] shows the classification results with and without 
LKDL. There is a clear improvement in the results when 
adding LKDL as pre-processing. However the obtained results 
in this experiment are lower than those reported in ifTTll . This 
can be explained by the fact that we used the original database 
of USPS, while the provided code had a demo intended for 
an extended translation-invariant version of USPS. In addition, 

^Found in http://www.umiacs.umd.edu/'^zhuolin/LCKSVD/ 

^Found in http://www.vision.ee.ethz.ch/~ yangme/database_mat/FDDL.zip 


TABLE II 

Classification accuracy of FDDL on the USPS digit database, 
WITH AND without LKDL PRE-PROCESSING 


Algorithm Accuracy 


FDDL 

95.79 

FDDL + LKDL 

96.03 


the exterior parameters Ai , A 2 , 51 , 52 were tweaked especially 
for the extended USPS, thus may have provided worse results 
in our case. 

2) Evaluation on the Extended YaleB Database: Next, we 
show the benefit of combining our method with LC-KSVD 
on the “Extended YaleB” face recognition database, which 
consists of 2,414 frontal images that were taken under varying 
lighting conditions. There are 38 classes in YaleB and each 
class roughly contains 64 images, which are split in half to 
training and test sets, following the experiment described in 
ifTSll . The original 192 x 168 images are projected to 504- 
dimensional vectors using a randomly generated constant ma¬ 
trix from a zero-mean normal distribution. We use a dictionary 
size of 570 (in average 15 images per class) and sparsity 
factor of 30, same as in na. The kernel chosen for LKDL 
was Gaussian of the form: «:(x, x') = exp (—||x — x'|| 2 / 2 cr^), 
where cr = 1. Due to the small size of the dataset, no sub¬ 
sampling was performed and c was set to be the entire size of 
the training set. The value of the parameter k (the dimension 
of the signal after eigen-decomposition) was set to 400, as it 
appeared that further dimensionality reduction of the already 
reduced 504-dimensional vector improved the results. In order 
to use the Gaussian kernel, the samples in the training and 
test sets were I 2 normalized, thus the original parameters of 
and y/P in expression ® had to be changed from 4 and 
2 to 1/30 and 1/91 correspondingly. These parameters were 
chosen using a coarse-to-fine search and provided the best 
classification results. We use the original classification scheme 

in El, El- 

Table Hn] shows the classification results of LC-KSVD 1 
and LC-KSVD2, with and without LKDL pre-processing. It 
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TABLE III 

Classification accuracy of LC-KSVDl and LC-KSVD2 on the 
Extended YaleB database, with and without LKDL 

PRE-PROCESSING 


Algorithm Accuracy 


LC-KSVD I 

94.49 

LC-KSVD I -F LKDL 

96.33 

LC-KSVD2 

94.99 

LC-KSVD2 -F LKDL 

96.33 


is clear that the addition of the nonlinear kernel function in¬ 
creases the discriminability of the input samples and improves 
classification results by up to 1.8% and 1.3% in the case 
of LC-KSVDl and LC-KSVD2, correspondingly. In fact, it 
appears that our LKDL blurs the differences between these 
two methods, meaning, there is no preference as to whether 
the classifier will be learned separately or jointly along with 
the dictionary. 

The improved discriminability of LKDL combined with 
LC-KSVD, versus LC-KSVD alone, can be demonstrated by 
inspecting the resulting sparse coefficients of the test set. In 
Fig. m one can see the obtained sum of absolute values of the 
sparse coefficient vectors of all 32 test samples from class ‘10’. 
The ideal distribution of atoms chosen during sparse-coding 
should be concentrated around atoms: [139, • • • , 150], which 
belong to class ‘10’. One can see that in the case of LC-KSVD, 
there are a few “successful” atoms which largely contribute 
to the reconstruction of the test samples, while in LC-KSVD 
combined with LKDL, the contribution is distributed more 
evenly between all of the atoms in that class. In addition, 
LC-KSVD alone will often choose atoms not corresponding 
with the given class, while in LKDL, the contribution of these 
atoms is fairly small. 

Next we explore the impact of LKDL on the size of the 
learned dictionary. Fig.|5a]shows the results of LC-KSVDl and 
LC-KSVD2, with and without LKDL, versus the average num¬ 
ber of dictionary atoms for each class. It is clear that LKDL 
improves the results of both LC-KSVDl and LC-KSVD2. 
With the addition of LKDL, a smaller dictionary with 7 atoms 
per person achieves the same results of LC-KSVD alone with 
15 atoms per person. This gap in performance grows as the 
size of the dictionary becomes smaller and reaches a 20% 
difference for 1 atom per person. The conclusion is that a 
more compact dictionary can be learnt using the combination 
of LC-KSVD and LKDL, without compromising accuracy. 

Fig. 1^ shows a similar experiment of the dependency 
of classification on the sparsity factor, i.e. the number of 
atoms used in the sparse reconstruction of a given signal. The 
combination of LKDL and LC-KSVD with a sparsity of 15 
achieves a better accuracy than that of LC-KSVD alone with a 
sparsity of 30. From both these figures it can be seen that the 
addition of LKDL can be helpful in reducing the complexity 
of the DL problem, without compromising the accuracy. 

3) Evaluation on the AR Face Database: The AR Face 
database consists of 4,000 color images of faces belonging 
to 126 classes. Each class consists of images taken over two 
sessions, containing different lighting conditions, facial varia¬ 
tions and facial disguises (sunglasses and scarves). Following 


TABLE IV 

Classieication accuracy of LC-KSVD 1 and LC-KSVD2 on the 
AR Face database, with and without LKDL pre-processing 


Algorithm Accuracy 


LC-KSVD I 

92.5 

LC-KSVD I + LKDL 

94 

LC-KSVD2 

93.7 

LC-KSVD2 -1- LKDL 

94.7 


the experiment in M, 2,600 images were chosen, first 50 
classes of males and first 50 classes of females. Out of 26 
images in each class, 20 were chosen for training and the 
rest for evaluation. We use the already-processed datasej^ in 
CD, where the original images of size 165 x 120 pixels were 
reduced to 540-dimensional vectors using random projection 
as in Extended YaleB. The cardinality is same as before 
set to 30 and the number of atoms in DL is set to 500 (5 
atoms per class). As before, we normalized all the signals to 
unit ^ 2 -norm. The parameters y/a and were determined 
using coarse-to-fine 5-fold cross validation. We have noticed 
that an optimal parameter of y/a for LC-KSVDl is not 
necessarily as good for LC-KSVD2, thus we chose two sets 
of parameters: y/a = y/P = 1/150 for the optimal result of 
LC-KSVDl (the value of /3 is not really used in LC-KSVDl), 
wA y/Pt = yfP = 1/120 for LC-KSVD2. 

In table |IV] we compare the classification results of LC- 
KSVDl and LC-KSVD2, with and without LKDL pre¬ 
processing. As can be seen our method improves the perfor¬ 
mance of LC-KSVDl by 1.5% and LC-KSVD2 by 1%. 

VI. Conclusion 

In this paper we have discussed some of the problems 
arising when trying to incorporate kernels in DL, and payed 
special attention to the kernel-KSVD algorithm by Nguyen et 
al. jm, |30l. We proposed a novel kernel DL scheme, called 
“LKDL”, which acts as a kernelizing pre-processing stage, 
before performing standard DL. We used the concept of virtual 
training and test sets and described the different aspects of cal¬ 
culating these signals. We demonstrated in several experiments 
on different datasets the benefits of combining our LKDL 
pre-processing stage, both in accuracy of classification and in 
runtime. Lastly, we have shown the easiness of integrating 
our method with existing supervised and unsupervised DL 
algorithms. It is our hope that the proposed methodology 
will encourage users to consider kernel DL for their tasks, 
knowing that the extra-effort involved in incorporating the 
kernel layer is near-trivial. We intend to freely share the code 
that reproduces all the results shown in this paper. 

Our future research directions include combining LKDL 
with online DL. We would also like to examine the benefit 
of applying LKDL to the sparse coefficients instead of the 
input signals and maybe combining both options. Lastly, our 
goal is improving the sampling ratio, i.e. the size of the matrix 
C, using more advanced sampling techniques. 

^Found in http://www.umiacs.umd.edu/~zhuoIin/LCKSVD/ 
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Fig. 4. Upper row: sum of absolute values of sparse coefficient vectors (of size 570, the size of the dictionary) corresponding to test examples from class ‘10’ 
in Extended YaleB database. The columns from left to right represent LC-KSVDl and LC-KSVD2 with and without the addition of LKDL pre-precessing. 
The additional colorbar features 38 bars which correspond to 38 classes in Extended YaleB. Bottom row: additional summation of the absolute values of 
sparse coefficients in every class. As expected, the majority of nonzero values in all sparse coefficient vectors originate from class ‘10’. 




Fig. 5. Dependance of accuracy in the average number of atoms per class (a) and the spai'sity factor (b). 
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